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Abstract 

The separability assumption ( Donoho fc Stoddeij . l2003tlArora et al. . 2012 ) turns non- negative 



matrix factorization (NMF) into a tractable problem. Recently, a new class of provably-correct 
NMF algorithms have emerged under this assumption. In this paper, we reformulate the separa- 
ble NMF problem as that of finding the extreme rays of the conical hull of a finite set of vectors. 
From this geometric perspective, we derive new separable NMF algorithms that are highly scal- 
able and empirically noise robust, and have several other favorable properties in relation to 
existing methods. A parallel implementation of our algorithm demonstrates high scalability on 
shared- and distributed-memory machines. 



1 Introduction 

A data matrix X of size m x n is said to admit a Non-negative Matrix Factorization (NMF) 
with inner-dimension r, if X can be expressed as X = WH where W, H are two non-negative 
matrices of dimensions m x r and r x n respectively. In many applications, a compact (i.e., small 
r) approx imate NMF t e nds t o provide a natural and interpretable part-based decomposition of 
the data (iLee &: Seung . often more appealing than other low-rank factorizations. NMFs 

arise pervasively in a variety of signal separation problems, such as modeling topics in text and 



hyperspectral image analysis ( Cichocki et al. . 20091 ) . 



Figure [J shows the geometry of the NMF problem. As a point cloud in M™ , all the n columns 
of X are contained inside a cone that is generated by r non-negative vectors in M'" comprising 
the columns of W. For any matrix A, let cone(A) denote the set obtained by taking linear 
combinations of the columns of A with non-negative coefhcients. Then, the goal is to find a non- 
negative matrix W, with just r columns, such that: cone(X) C cone(W) C M^, where M!p denotes 
the non- negative orthant in M™. Such polyhedral nesting problems studied in computational ge- 
ometry a re known to b e NP-hard, which makes the exact and approximate NMF problem also 
NP-hard (IVavasisl. 120091) . Faced with such results, a l most t he entire algorithmic focus in the NMF 
literature, e.g.. dCichocki et all I2OO9I : Ibee fc Seunl Il999l : IPhJ . l2007l : iHsieh fc Dhillonl . l201lh . has 
centered on treating the problem as an instance of general non-convex programming, leading to 



*T]iis work was done when AK was a summer intern at IBIVI Research. 
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Figure 1: Geometry of the NMF Problem. Separability implies that data is contained in a cone 
generated by a subset of r "anchor" points (red squares). 

heuristic procedures that lack optimality guarantees beyond convergence to a stationar y point of the 



objective function for approximate NMF. Very recently, in a serie s of elegant papers (lArora et al 



20121 : iBittorf et aP . 120121 : lOillis fc VavasieJ . I2OI2I : lEsser et alJ . I2OI2I : lElhamifar et al.l . [201^), promis- 
ing alternative approaches have been developed based on certain separability assumption on the 
data which enables the NMF problem to be solved exactly. Geometrically, the assumption states 
the following: 

Definition 1.1. Separability Assumption: The entire dataset, i.e. all columns ofJi., reside in 
a cone generated by a small subset of r columns ofJi.. 

In algebraic terms, X = WH = Xy^H so that the r columns of W are hidden among the 
columns of X (indexed by an unknown subset of indices A). Equivalently, a corresponding subset 
of r colu mns of H happen to constitute the r x r identity matrix. We refer to these columns as 
anchors ( Arora et al.l . 20121 ). Informally, in the context of topic modeling problems where X is a 
document-word matrix and W, H are document-topic and topic-term associations respectively, the 
separability assumption equivalently posits the existence of special anchor words in the vocabulary, 
whose occurence uniquely identifies the presence of a topic, and whose usage across the corpus is 
collectively predic tive of the usage of all the other words. The separability assumption was inves- 
tigated earlier by Donoho Sz Stodden ( 20031 ) who showed that it implied uniqueness of the NMF 
solution, modulo permutation and scaling. In order to place our contributions in the right context, 
we first briefly provide a flavor of recently proposed separable NMF algorithms. 



Related Work: Assuming that the columns of X are normalized to have unit /i-norm, the 
separable NMF problem reduces to that of finding the extreme points (that is, points inexpress ible 



as convex combinations of other points) of the convex hull of the columns (jArora et al.l . l2012l ). A 



Linear Program (LP) can be setup to attempt to express a given column as a convex combina- 
tion of th e other columns. If this LP declares infeasibility, an extreme point is identified. This 
approach (jArora et al.l . |2012| . Section 5) requires solving n feasibility LP's each involving n — 1 vari- 
ables which is not scalable for many problems of interest. A noise-robust vers ion of the procedure 



further requires knowledge of parameters that are hard to estimate apriori. iBittorf et al.l (|2012l ) 



formulate a single LP whose solution resolves the exactly separable NMF problem. An extension 
is also developed for noise-robustness. Instead of invoking a general LP solver, a specialized al- 
gorithm is derived based on an incremental stochastic gradient descent procedure, and its parallel 
(multithreaded) implementation is benchmarked on large datasets. On the other hand, this al- 
gorithm requires estimates of primal and dual step sizes, converges only a symp totically, and does 
not explicitly exploit the sparsity of the final solution. Gillis Sz Vavasi j ( 2012 ) develop a highly 
scalable approach closely related to rank-revealing QR factorizations for column subset selection. 
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A perturbation analysis of this algorithm under noise is also presented. In Esser et al. ( 20121 ). 
column subset selectio n is cast essentiall y as a form of multivariate regression with row-sparsity in- 



ducing norms, e.g., see lBien et al 



Algorithms derived in this framework are asymptotically 
convergent, and sensitive to near-duplicate columns, making it necessary to perform certain adhoc 
preprocessing steps. 



Contributions: We present a new family of highly scalable and empirically noise-robust algo- 
rithms for separable NMFs, with several favorable properties: 

o The algorithms produce a correct solution for the separable case after exactly r iterations. They 
require no additional parameters. They are closely related t o convex and conical hull findiii g 



procedures proposed in the computational geometry literature ( Clarkson . 1994 : Dula et al. . 19981 ) 



Comput ationally, the algorithms bear some resembl ence to simultaneous Orthogonal Matching 
Pursuit (|Buhlmann fc Geeil . lioinl : iTropp et aLl . l2nnfil ) for sparse greedy reconstruction of multiple 
target variables from the same subset of input variables. We also derive a variant based on this 
connection that performs quite well under noise. 

Under controlled noise conditions in synthetic datasets and on real-world topic modeling prob- 
lems, our algorithms consistently outperform other separable NMF techniques with respect to 
multiple performance metrics. Our methods are highly competitive with existing non-convex 
NMF algorithms, but are free of sub-optimal local minima and associated initialization issues. 
The solution for (r — 1) target anchors is contained in the solution for r target anchors (unlike 
non-convex NMF methods), which makes it easier to do model selection on real- world datasets 
by keeping track of performance on a validation set. 

The algorithms are highly scalable and have small memory footprint. The sparsity of the data, the 
intermediate variables and the final solution is carefully exploited in a high-performance parallel 
and distributed implementation which scales excellently on both shared- and distributed-memory 
machines. For example, a twitter corpus with 125-thousand tweets can be factorized for r = 100 
in less than 10 seconds on a commodity 8-core machine. 

Unlike all existing algorithms, no column normalization is needed. Such normalization interferes 
with the TFIDF weightings routinely used in text modeling applications, leading to performance 
loss. 

Unlike lEsser et al.l toli ) , the algorithms do not require any special preprocessing to eliminate 
duplicate or near-duplicate columns. 



2 Fast Conical Hull Algorithms 

An informal description: Figure [2] provides some geometric intuition underlying the proposed 
approach. The algorithm executes r iterations. In each iteration a new anchor column is identified. 
This corresponds to expanding a cone one extreme ray at a time, until the entire dataset is eventually 
contained in the cone defined by the full set of anchors. Figure [2] illustrates one step of the algorithm 
where there is an existing cone defined by three extreme rays (marked 1 to 3). To identify the next 
extreme ray, the algorithm picks a point outside the current cone (a green point) and projects it 
to the current cone to compute a residual vector (we call this the projection step). This residual 
vector separates the current cone from at least one non-selected extreme ray that can be found by 
maximizing a specific selection criteria (we call this the detection step). Intuitively, the algorithm 
picks a face of the current cone (spanned by rays 1 and 3 in Figure [2|) that "sees" exterior points 
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and rotates this face towards the exterior until it hits the "last" point. In the example shown in 
Figure [21 ray 4 is identified as a new extreme ray. 




Figure 2: Geometry of the Conical Hull algorithms 



These geometric intuitions are inspired by IClarksonI ^1994 ): lOula et al.l (1 19981 ^ who present LP- 



based algorithms for general convex and conical hull problems. Their algorithms are also directly 
applicable in our NMF setting, provided the data satisfies the separability assumption exactly. In 
this case, the residual of any single exterior point can be used to correctly expand the cone as 
described above. However, anchor detection criteria derived from multiple residuals demonstrates 
radically superior noise robustness, as we report in the experimental section. The emphasis on 
scalability and noise-robustness thus leads us to a new family of algorithms whose implementation 
(and associated proof of correctness) is distinct from prior work. 

Cones, Extreme Rays and Projection: Here, we provide a short background on relevant 
geometric concepts and set some notation. Recall that a cone C is a non-empty convex set that 
is closed with respect to taking conic combinations (i.e., linear combinations with non-negative 
coefficients) of its elements. A ray in C generated by a vector x 7^ G C is the set of all vectors 
{tx : t > 0}. A ray R is an extreme ray if its generators cannot be expressed by taking conic 
combinations of elements in C that do not themselves belong to R. A cone is called finitely gen- 
erated if its elements are conic combinations of a finite set of vectors, and pointed if it does not 



contai n both a vector x as well as its negation — x. A fundamental result (e.g., see iNemirovski 



states: a pointed, finitely generated cone C possesses a finite and unique set of extreme 



rays, and C is the conical hull of the generators of these extreme rays. Furthermore, the genera- 
tors of these extreme rays are a subset of the finite set of vectors used to originally express the 
cone. In the NMF context, note that any cone contained in MJ^ is pointed. This implies that 
cone(X) can also be described by a minimally compact set of generators, i.e., cone(X) = cone(X.A) 
where A uniquely indexes the extreme rays (anchors). Thus, a non-negative matrix X admits a 
separable NMF with inner-dimension r if the number of extreme rays of cone(X), i.e. size of A, 
coincides with r. A face of a cone is the intersection between the cone and a supporting hyperplane. 
The projection of a point x onto the cone generated by columns of a matrix W, i.e. computing 
z* = arg min^g^o„g(-v^-) ||x — zjlg, can be obtained by solving a non- negative least squares problem, 
i.e., computing h* = argmin/j>Q ||x — W/ilH and setting z* = W/i*. All columns of X can be 
simultaneously projected by solving H* = argminjj>o 11^ ~ We will use the notation R 

to denote the residual matrix after a projection operation, i.e., R = X — WH*. We will use the 
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Algorithm 1 Xray : Algorithms for Separable NMF 



Input: X e M" separable with inner dimension r 
Output: W G Kmxr, H G Mrxn, T indiees in A 

such that: X = WH, W = X^ 
Initialize: R ^ X, A ^ {} 
while < r do 

1. Detection Step: Find an extreme ray. 

Below, p is a strictly positive vector (not collinear with 



Some exterior point selection criteria: 



R^X 

j* = argmax ' for any z : ||Rj||2 > (1) 

j P -^3 



rand : any random i : ||Ri||2 > (2) 

max : i = argmax^^ IIR-fclb (3) 

dist : i = argmaxfe || (R^^X)^ h (4) 

II (R,^X ) II ^ 

A greedy variant: j* — argmax^ ||Xj"i|^"^ ^'^^ 



2. Update: A <— A U {j*} (see Remarks) 

3. Projection Step: Project onto current cone. 



H = argmin||X-XAB||2 (Algorithm 2) (6) 

B>0 



4. Update Residuals (not explicitly): R = X — X^H 
end while 



notation Xj,Rj to denote the i column of X and its corresponding residual. The notation q_|_ 
will denote the vector obtained by setting all negative entries of the vector q to 0. 



2.1 Algorithm Description, Correctness Result and Variations 

Algorithm [T] details the steps of the proposed family of algorithms which we call Xray . Each 
iteration consists of two steps: (i) a detection step that finds a column(s) of X to be added as an 
anchor, and (ii) a projection step where all data points are projected onto the current cone to get 
the residuals. Projection is done by solving simultaneous nonnegative least squares problem using 
Algorithm [2l Every residual vector Rj obtained after the projection step is normal to one of the 
faces of the current cone. In the selection step, we pick a face of the current cone (identified by 
™. — .a. .0.. .0 „e o„ „„e p^. . . (v, . ^) . 

strictly positive vector p, and expand the current cone by selecting an extreme ray that maximizes 
the inner product HfYj. The selection step can be implemented in various ways - some options 
are listed in Algorithm [TJ 

To show that Xray correctly identifies all the extreme rays, we need the following lemmas. 

Lemma 2.1. The residual matrix R, obtained after projection of columns of ^ onto the current 
cone satisfies H'^'Ka ^ 0, where "Ka o-i"^ the extreme rays of the current cone. 
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Algorithm 2 Solver for: argming>Q ||X — X^B^ 

Input: X g K"'^", Index set A with r indices, Initial value for warm-starts: 'Binit £ K'' 

Convergence paramaters tol, maxcycles 

Initialize: B = 'Binit 

Set: C = X^X G R"><" 

S = Ca^a e K''^'' 

s = diag(S) 

U = B^S e R"^'- 

while true do 

for i = I . . .r {//cyclic coordinate descent) do 

b = ^ = (B^), 

u = Uj Sib 

b = s-i(Q-u)+ 

5 = h-8 

U = U + ^S^ // sparse rank-1 update 
(B^)« = b 
end for 

objective = ||X||2 + ELi(U, + CO^(B^)^ 
Exit if Aobjective < tol, or #iters > maxcycles. 
end while 



Proof. Residuals are given by R = X — X^H, where H = arg ming>Q||X — X^B||p. 
Forming the Lagrangian for Eq. [6l we get L(B, A) = ||X — X^B||p — tr(A^B), where the matrix 
A contains the nonnegative Lagrange multipliers. Differentiating w.r.t. B and evaluating at the 
optimum B = H, we have the following from the KKT conditions: 2X^(XyiH — X) — A = 
-2X5R = a > R^Xa < □ 

Lemma 2.2. For any point Xj exterior to the current cone, we have Rf Xj > 0, where Rj is the 
residual of Xj obtained by projecting it onto the current cone. 

Proof. Let R = X — Xy^H, where H = argming^QHX — X^BHp and X^ are the extreme rays of the 
current cone. From the KKT conditions (used in the proof of Lemma [2. ip we have 2R^Xyi = — A"^, 
where A are the Lagrange multipliers. Hence, 2R?"Xyi = —A.f (ith row of both left and right 
side matrices). From the complementary slackness property, we have AjiUji = Vj, i. Hence, 
2Rf X^Hi = -AjUi = 0. 

Hence we have Rf Xj = Rf (Rj + Xy^Hi) = ||Ri||^ + RfX^Hi = ||Ri||^ > since Rj 7^ 0. □ 

Using the above two lemmas, we prove the following theorem regarding the correctness of 
Algorithm [T] 

Theorem 2.1. The data point Xj* added at each iteration in the Detection step of Algorithm [Jl 
if the maximizer in Eqn. [1] is unique, is an extreme ray of C that has not been selected in previous 
iterations. 

Proof. Let the index set A identify all the extreme rays of C. Under the separability assumption, 
we have X = X^H. Let the index set A* identify the extreme rays of the current cone C*. 

Let Yj = ^Tx~ ^^'^ ~ ^A[diag{p^X.A)]~^ (since p is strictly positive, the inverse exists). 

Hence = y A^^^^^^^f^f^ . Let Cj = ['^^"^(pg^)]". _ ^e also have p^Y,- = 1 and p^Y^ = 1^. 
Hence, we have 1 = p^Yj = p^Yy^Cj = I'^Cj. 
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Using Lemma [2. 11 Lemma [2.2l and the fact that p is strictly positive, we have maxi<j<„ HfYj = 
maxj^^t KfYj. Indeed, for ah j G A* we have R*Yj < using Lemma [2. II and there is at least one 
j = for which R*Yj > using Lemma 12.21 Hence the maximum lies in the set {j : j ^ A^}. 

Further, we have max^^^t R?"Yj = max^^^t RT'Y^Cj < maxjg(yi\^t) R^Y^. The second in- 
equality is the result of the fact that ||Cj||i = 1 and Cj > 0. This implies that if there is a unique 
maximum at a j* = argmax^^^^t R^'Yj, then Xj* is generator of an extreme ray of the cone C. □ 

Remarks: (1) If the maximum occurs at two points and j^, both these points Xj* and Xj* 
generate the extreme rays of the cone C. Hence both are added to anchor set A. If the maximum 
occurs at more than two points, some of these are the generators of the extreme rays of C and others 
are conic combinations of these generators. We can identify the extreme rays of this subset of points 
by calling Algorithm 1 recursively and add them to anchor set A. (2) Note that the algorithm is 
not influenced by presence of repeated anchors. (3) In the Algorithm, the vector p simply needs 
to satisfy p"^Xj > 0,i = 1 . . . n. In our implementa t ion, w e simply used p = [!,...!] G M™, i.e., 
p-'^Xj = ||xj||i. (4) Note that unlike Gillis &: VavasisI ( 2012 ). we do not need Xa to be full-rank. 



Exterior Point Selection: It can be noted that residual of any point exterior to the current 
cone (i.e., any Rj ^ 0) can be used in the selection step of Algorithm 1. This gives us multiple ways 
of expanding the current cone depending on which i is chosen - all of which solve the separable 
problem but may behave very differently in the presence of noise. Some natural options are listed 
in Algorithm 1: choosing a random exterior point (Eqn. [2|), one with maximum residual norm 
(Eqn. [3|) or one which defines a normal to a supporting hyperplane of the current cone which "sees" 
maximum "mass" of points in its positive halfspace, as measured by Eqn. SI In the experiments, 
we will refer to these variants as Xray (rand), Xray (max) and Xray (dist) respectively. 

A Greedy variation for noisy data: In high dimensional noisy data almost all the points 
may masquerade as anchors. A natural choice is to expand the current cone greedily by select- 
ing a point that best minimizes the current residual, i.e., j* = argmin^ minb>o||R — ^jt>^|||'. 
This selection criterion simplifies to Eqnl5]in Algorithm 1 (referred as Xray (greedy) henceforth). 
One may view thi s approach as imp lementing a nonnegative variant of simultaneous orthogonal 
matching pursuit ( Tropp et al. . 20061 ). which is a greedy approach to the problem of sparse regres- 



sion of multiple response variables on the same subset of explanatory variables, i.e., for solving 
minB>o||X — XB|||, s.t. ||B||o,i = r where ||B||o,i pseudo-norm counts the number of non-zero 
rows in B. In the context of separable NMF, both response variables and ex planatory variable s 



are the columns data matrix X. A relaxed version of this problem is solved in lEsser et al.l (120121) 
(miriR>nl|X — XB|||, + A||B||i_oo)- It is also possible to have ||B||i^2 penalized variant (jTroppl . 



200fil : iBien et al.l . boid ) which is natural for sparse multivariate regression problems. Note that 



the greedy approach is not guaranteed to solve the separable NMF problem, but may perform well 
in the noisy settings as we observe in our experiments. Intuitively, this variant is concerned with 
greedily optimizing all residuals on average at every iteration, instead of making a decision based 
on the residual of a single, albeit well-chosen, exterior point. 

Solution Refinement and Model Selection: In practice, the separable solution (W, H) as 
obtained from Algorithm 1 may be further refined with a few steps of alternating optimization with 
respect to a divergence measure of interest (e.g., Frobenius reconstruction ||X — WH||p). Also, in 
real-world datasets, the value of r is typically unknown. Since our algorithms build the solution 
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one anchor at a time, r can be set based on a performance measure evaluated on held-out data. 
Alternatively, Algorithm 1 can exit if the amount of improvement from introducing a new anchor 
falls below a prespecified threshold. 

2.2 Scalability and Parallelization 

Here we describe various implementation details that allow us to gracefully scale to large sparse 
datasets (e.g., document-term matrices). The detection step can be parallelized by scoring the 
candidate anchors simultaneously. Likewise, the projection step involves solving Eqn. El which is 
separable in the columns of B and hence can be optimized in parallel. 

Detection Step: We avoid materializing the dense residual matrix R in the evaluation of 
the anchor selection criteria. Instead, we score candidate anchors on-the-fly as we compute (but 
not explicitly materialize) a matrix Q = (R^X)_|_ = (C — {CaH.)'^)_^_ where C = X-^X denotes 
a covariance matrix (word- by- word for topic modeling applications). Here, the potential sparsity, 
symmetry of the covariance matrix C as well as the non-negativity of H can be further exploited. 
For example, if Cij = 0, the corresponding entry in the product (C^H) need not be computed, 
since the resulting negative value is anyway reset to zero by the (■)-)_ thresholding operator. On a 
P core machine, the selection criteria may be evaluated in 0( "'"^p*"^^ ) time where nnz(C) is the 
number of non-zeros in C. If C is dense, we compute Q using parallel dense BLAS-3 operations. 
The one time computation of C is done via a parallel aggregation of rank-one outer-product terms 
defined by the rows of X. 

Projection Step: Algorithm 2 gives the steps of a cyclic block coordinate descent algorithm or- 
ganized around very light-weight incremental sparsity-exploiting updates for solving Eqn. [6] (deriva- 
tion omitted for brevity). The algorithm can be invoked in parallel on columns of X to compute 
the corresponding columns of B. The previous value of B is used to warm start the optimization 
and typically a very small number of iterations is needed for convergence. 



3 Empirical Observations 



Here, we report extensive comparisons on synthetic and medium-scale topic modeling problems, 
and benchmark our parallel implementation on large text data sets on multicore m achines and 
distributed syste ms. We compare with the methods proposed in Bittorf et al. ( 20121 ) (abbrv. as 
Hottopixx ) and iGillis &: VavasisI (120121) (abb rv. as GV), as well as traditional NMFs based on 



alternating optimization ( Cichocki et al. . 20091 ). The sourc e codes for Hottopi xx and GV were taken 



from the respective authors' websites. In comparisons with lEsser et al.l ( 20121') . it was observed that 



(|2012l V This characteristic 



it tends to select near-duplicate anchors, as also mentioned in lEsser et al. 
causes it to consistently perform less favorably compared to other methods unless the data is 
preprocessed in an adhoc fashion to remove simila r columns of X; he nce we do not include it in 
our list of baseli nes. We also do no t compare with lArora et al. I (EH) since Hottopixx reportedly 
performs better ( Bittorf et al. . 20121 ) and the algorithm requires parameters which are hard to guess 
apriori. 
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3.1 Synthetic experiments 

We perform a synthetic experiment that injects controhed amount of noise to corrupt the separable 
structure. Each entry of the matrix W € M^^*'^^'^ is generated i.i.d. according to a uniform 
distribution between and 1. The matrix H € jR^^^io -g ^^j^gj^ [I20X20 H'] where each 

column of H' G M^'^^^^'^ is generated according to a Dirichlet distribution whose parameters are 
chosen uniformly in [0, 1]. The data matrix X is set to WH + N where each entry of noise matrix 
N is generated i.i.d. according to a Gaussian distribution with zero mean and std. dev. 6. Fig. [3] 
plots the fraction of correctly recovered anchors (averaged over 10 runs for each value of 5) against 
the noise level 5 ranging from to 1.5. The proposed Xray (max) shows the best noise-robustness 
in terms of anchor recovery, followed by Xray (dist) and GV. Although Xray (greedy) does not 
perfectly resolve the separable NMF problem (5 = 0), it performs better than Hottopixx and is 
competitive with GV for near-separable case {6 > 0). As described below, on real datasets it 
turns out to be highly competitive. Xray (rand), although solves the separable problem {6 = 0), 
degrades significantly under noise, which shows that proper selection of an exterior point to expand 
the current cone is crucial for noise-robustness. 
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Figure 3: Anchor recovery rate versus noise level (best viewed in color) 



3.2 Medium-scale Topic modeling problems 

We evaluate the proposed method s on thr ee human-labeled text datasets that ar e commonly used in 



topic modeling literature: TDT-2 (lTDT2l l (m = 9394, n = 19528, r = 30), BBC (IGreene fc Cunningham . 



20061 ^ (m = 2225, n = 9635, r = 5) and Reuters (jReuterd ) (m = 7285, n = 18221, r = 10). We used 



standard tf-idf representation with document frequency thresholding in constructing the data ma- 
trix X. As required in Hottopixx and GV, we use £i-normalized columns of X (referred as matrix 
X(^i) henceforth) to identify the anchor column indices A^^^\ and use the unnormalized data X 
(and the corresponding anchor columns X^{f^)) for classification and clustering tasks. The use of 

X^^^ in clustering and classification (for any index set A) resulted in significantly worse perfor- 
mance uniformly for all methods so these results are not reported. For the sake of clarity in the 
figures, we do not show the results for Xray (max) which performed almost similar to Xray (dist) 
in these experiments. 

Classification experiments: Figure |4] shows the classification accuracy results obtained with 
the features (columns of the document-term matrix restricted to anchor words) selected by dif- 
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SVM accuracy with selected features (TDT) 



SVM accuracy with selected features (BBC) 



SVM accuracy with selected features (Reuters) 
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Figure 4: Classification Accuracy using selected features on TDT, BBC and Reuters datasets (best 
viewed in color) 
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Iterations 
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Figure 5: Clustering Performance on TDT, BBC and Reuters datasets (best viewed in color) 



ferent methods on the three datasets. Black dotted line is the classification accuracy with full 
features (all the words). We use 5% of the documents for training and the rest 95% for testing to 
emulate a semi-supervised learning scenario where we view various methods as inducing a topical 
representation b ased on all (unla beled) data. We use multiclass SVM classifier as implemented 
in LIBLINEAR ( Fan et all . 20081 ) and use four- fold cross validation to select the parameter C. 
Among separable NMF techniques, the proposed Xray (greedy) and Xray (dist) (with exception 
on Reuters) outperform Hottopixx and GV on all the three datasets, more so on TDT. On average, 
traditional NMFs with local optimization perform quite well on these datasets especially when r 
is small, but can show significant performance variance (shown as error bars) with respect to ini- 
tialization. As the number of topics increases, the performance gap between the proposed methods 
and the local optimization method rapidly diminishes. In this regime our techniques are a viable 
alternative to local optimization methods, and have the advantage of being local-minima-free, i.e., 
eliminating uncertainty with respect to initialization and therefore not requiring multiple runs. 



Clustering experiments: We also evaluated clustering performance by assigning a cluster 
label to each document based on the maximum element in the corresponding row of W. We re- 
fine the solution with a few iterations of alternating optimization. Figure [5] shows the clustering 
performance in terms of Normalized Mutual Information (NMI) as these iterations proceed. We 
also show the NMI obtained with local search method after it has converged to a local optimum 
(averaged NMI from ten runs with different random initializations is shown; error-bar indicates the 
variation around the average). Again, the proposed Xray methods are among the best performing 
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methods in terms of clustering performance and do not require multiple runs as traditional NMFs do. 



Effect of column normalization: In text processing, tf-idf features are popular due to their 
good empirical performance in va rious tasks. Howeve r , most of the previousl y proposed rnethod s 
for the separable NMF problem dBittorf et all boij iGillis k Vavasid . I2OI2I : lArora et all l2012l ) 
require the columns of X (i.e, words for text data) to be ii normalized, which can disturb the 
tf-idf structure. We conduct a small experiment to study the effect of word normalization on the 
prediction performance of the proposed methods. We identify anchor sets A^^'^\ A^^^^ and A by 
the proposed methods using the data matrices X^^^^ (^1 -normalized columns), X^^^^ (^2-iiormalized 
columns) and X, respectively and use X^(fi), X^{<?2) and X^ for classification (same SVM setup as 
described earlier). Table [1] shows the classification accuracy for 100 topics on the three datasets. 
These empirical results suggest that ii normalization of words on top of tf-idf features, as required 
by other separable NMF methods, can actually adversely affect the predictive quality of the selected 
anchors. 



Table 1: Effect of word normalization. The numbers are classification accuracies with selected 
features for r = 100. 





Xray (dist) 


Xray (greedy) 


h 


h 


None 


h 


h 


None 


TDT 


21.06 


83.04 


84.52 


31.69 


90.05 


91.87 


BBC 


58.59 


84.36 


87.73 


75.41 


82.18 


87.82 


REUT 


51.88 


76.88 


64.46 


55.65 


81.28 


83.02 



Table 2: Anchors (red) and top keywords for a few sample topics from TDT dataset. 



Xray (dist) 


le"winsky;nionica;grand;jury;starr;intern;white;cliiiton;housc;counscl 

iraq;weapons;baghdad;inspectors;iraqi;annan;council;military;inspections;sites 

tobacco;industry;senate;settlenient;snioking;bill;legislation;companies;billion;minnesota 

suharto;indonesia;indonesian;habibie;jakarta;president;riots;anti;reforms;resign 

shuttle;columbia;space;astronauts;nasa;niission;;crew;rats;aboard;experiments 


Xray (greedy) 


lewinsky;nionica;grand;jury;starr;intern;white;clinton;house;counsel 
iraq;weapons;baghdad;inspectors;inspection;inilitary;council;united;teani;strikc 
tobacco;industry;senate;settlenient;snioking;bill;legislation;companies;billion;minnesota 
suharto;indonesia;habibie;indonesian;jakarta;president;riots;anti;resign;refornis 
shuttle ; Columbia; space ; astronauts ; nasa; mission; crew ; rat s ; aboard ; experiments 


Gillis & Vavasis 


acknowledging;constitute;contact;physical;privately;intern;sexual;relationship;mattcr 

approves;warns;word;baghdad;deal;nations;iraq;united;approved;financing 

successive;alive;votes;checking;bill;supporters;dead;tobacco;stories;senate 

grows;secure;feels;worst;suharto;leave;hour;indonesia;country;crisis 

tall;astronauts;columbia;space;rotating;experiments;ear;inch;chair;stretch 


Hottopixx 


acknowledging;constitute;physical;intern;sexual 

ability;weapons;iraq;destruction;deny;tuned;develop;determined;mass;clinton 
accountable;tobacco;figliting;brands;bill;gop;legislation;desperately;intend;senate 
absence;worsened;turmoil;indonesia;political;suharto;leads;thinks;exercise;track 
approaching ;shuttle ; Columbia; space ; nasa; experiments ; extending ; astronauts ; weather 
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Table 3: Datasets used for large-scale experiments. Times are for r = 100 on eight cores on daniel. 



Name 


^documents 


#words 


nnz(X) 


nnz(X-' X) 


Sparsity(XJ X) 


Timc(r 100) 


Memory 


RCVl 


781265 


43001 


59.16e06 


172.3e06 


5% 


409 sees 


3.6 GB 


PPL2 


351849 


44739 


19.43e06 


1.99e09 


99.8% 


1147 sees 


30.2 GB 


IBMT 


124708 


25998 


1.03c06 


1.77e06 


0.2% 


9.8 sees 


1 GB 



Table 4: Running times of Xray versus Hottopixx on 8 threads for r topics and E epochs. 



Dataset 


Xray (sees) 


Hottopixx (sees) 


r=25 


r=50 


r=100 


E=5 


E=10 


r=25 


r=50 


r=100 


r=25 


r=50 


r=100 


IBMT 


0.38 


1.78 


9.8 


338.6 


337.2 


327.7 


642.1 


668.5 


636.9 


RCVl 


15.4 


67.2 


409 


2026.8 


1938.3 


1883.6 


3769 


3774.7 


3888.9 


PPL2 


196 


443.8 


1147 


1818.1 


1935.5 


1892.8 


3725.2 


3895.5 


3913.7 



Quality of anchor words: Qualitatively, we found that anchor words selected by the pro- 
posed Xray methods tend to be more representative of the topics compared to those selected by 
Hottopixx and GV. Table [2] shows top words and anchors for a few topics (Lewinsky scandal, Iraq 
nuclear program, National Tobacco Settlement, Indonesia riots of 1998 and Columbia space shuttle) 
extracted from the TDT dataset. 



3.3 Large-scale Experiments 

We implemented a shared- and distributed-memory parallel version of Xray in C++. That is, 
our implementation can exploit parallelism when running on multi-cor e machines, or on clust ers 
of multi-core machines. For shared- memory parallelism, we use PFunc (|Kambadur et al.l . l2009l ) 



a 



lightweight and portable library that provides C and C++ APIs to express task parallelism. For 
distributed-memory parallelism, we use MPfl, a popular library specification for message-passing 
that is used extensively in high-performance computing. 

To test the shared-memory performance and scalability of Xray , we ran experiments on 
daniel, a dual-socket, quad-core Intel® Xeon^M X5570 machine with 64GB of RAM running Linux 
Kernel 2.6.35-24 (total 8 cores). For compilation, we used GCC v4.4.5 with: "-03 -f omit-f rame- 
pointer -funroll-loops" in addition to PFunc 1.02, Op enMPI 1.4.5 and u ntuned ATLAS BLAS. 
We ran large-scale experiments on three dataset s: RCV l (|Lewis et al.l . 120041 1. co-occurence matrix 



of people and places from ClueWeb09 dataset ( Lemuil ) . and IBM Twitter (IBMT) dataset. The 



statistics relating to these three large datasets are presented in Table [3l We report scalability results 
for Xray (greedy) - other variants are computationally very similar. 

Figure [6] depicts the multi-threaded performance of our implementation on daniel while de- 
tecting 100 topics. Our implementation is able to factorize RCVl in 409 seconds on 8 cores and 
achieve 4.2x speedup over 8 threads when compared to the sequential implementation. Similarly, 
for IBMT we achieve 4.5x speedup, while completing the factorization in 9.8 seconds on 8 cores. 
For the dense X-'^X case, we are able to factorize PPL2 in 1147 seconds with just 8 cores. We 



^http: 7/www .mpi-f orum. org/"] 
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Multi-threaded Scalability 



S 3 




1 2 3 4 5 6 7 
Number of PFunc Threads WPI Processes = 1) 



Figure 6: Multi-core speedup of RCVl, IBMT, and PPL2 datasets when running on daniel. All 
times are for R=100. 

believe that further speedup improvements can be demonstrated on these problems by (a) opti- 
mizing the data layout of various sparse matrices to alleviate memory contention amongst threads, 
and (b) in dense problems such as PPL2, by using a version of BLAS tuned to our architecture 
and by reorganizing our implementation around more BLAS-3 operations that have better memory 
to compute ratio than BLAS-1 or 2 operations. Our implementation showed good scalability on 
distributed-memory machines as well (details omitted for brevity). 



T o compare our performance against the state-of-the-art Hottopixx algorithm (jBittorf et al. 
2012I ). we ran their algorithm on daniel with the options " — dual 0.01 — epochs 10 — splits 
8 — hott <R> — normse 1 — primal le-6" set in close consultation with the authors. A detailed 



comparison is shown in Table 13. 2i A head-to-head comparison is difficult because of the different 
performance characteristics of Hottopixx and Xray . For example, Hottopixx 's per-epoch runtime 
is not dependent on r, the number of topics, but it's accuracy is dependent on E, the number 
of epochs, while our methods execute exactly r iterations, where each iteration has a superlinear 
dependence on r. Nonetheless, for all three datasets, we see that Xray performs better than 
Hottopixx even when Hottopixx is run only for 5 epochs. In particular, for the sparse datasets 
IBMT and RCVl, Xray runs to completion in significantly shorter amount of time than Hottopixx 



4 Conclusions and Future Work 

Our methods perform favorably in comparison to other recently proposed separable NMF algo- 
rithms and offer highly scalable local-minima-free alternatives to existing local optimization tech- 
niques. Future work includes a formal noise analysis of the proposed algorithms, investigating the 
streaming setting where documents or words arrive in an online fashion, and using our models for 
social media content analysis. 
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